12/3/2020
Synthetic data can be generated easily, like this:
# specify cart and alter predictor matrix
cart <- rep("cart", ncol(dat))
names(cart) <- colnames(dat)
cart['bmi'] <- "~I(wgt / (hgt/100)^2)"
pred <- make.predictorMatrix(dat)
pred[c("wgt", "hgt"), "bmi"] <- 0
syns <- dat %>% mice(m = 5,
method = cart,
predictorMatrix = pred,
where = matrix(TRUE, nrow(dat), ncol(dat)),
print = F,
seed = 123)
On the right, there is a plot of the distribution of age, with the actually observed data in red, and the synthetic data averaged over the five imputation rounds in blue.
To make correct inferences from the synthetic data, we need to use the correct estimators. For instance, we could have
\[ \begin{align} \bar{q}_m &= \frac{1}{m}\sum^m_{i=1} q^{(i)}, \\ b_m &= \sum^m_{i = 1} \frac{(q^{(i)} - \bar{q}_m)^2}{m-1}, \end{align} \]
\[ \begin{align} \bar{u}_m &= \frac{1}{m} \sum^m_{i = 1} u^{(i)}, \\ T_f &= (1 + \frac{1}{m})b_m - \bar{u}_m, \end{align} \]
with \(\bar{q}_m\) the mean of the estimates, \(b_m\) the between-variability, \(\bar{u}_m\) the within data variability, \(u^{(i)}\) the within-variance and \(T_f\) the total variance of the estimate, as proposed by Raghunathan, Reiter, & Rubin (2003).
Raghunathan, T. E., Reiter, J. P., & Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1), 1.